In [ ]:
 

Mel Schwan, Stuart Miller, Justin Howard, Paul Adams

Lab One: Visualization and Data Preprocessing

CRISP-DM stands for cross-industry process for data mining. It provides a structured approach to planning a data mining project. It is a robust and well-proven methodology.

For Lab One we will use the first two portions of this methodology. We will start by focusing on understanding the objectives and requirements from a business perspective, and then using this knowledge to define the data problem and project plan. Next we will identify data quality issues, discover initial insights, or to detect interesting nuggets of information that might for a hypothesis for analysis.

In future labs we will execute all of the CRISP-DM steps.

1. Stage One - Determine Business Objectives and Assess the Situation

The first stage of the CRISP-DM process is to understand what you want to accomplish from a business perspective. We will define our objectives and constraints that must be properly balanced. The goal of this stage of the process is to uncover important factors that could influence the outcome of our project.

1.1 Assess the Current Situation

This analysis of Home Credit's Default Risk dataset will focus on generating accurate loan default risk probabilities. Predicting loan defaults is essential to the profitability of banks and, given the competitive nature of the loan market, a bank that collects the right data can offer and service more loans. The target variable of the dataset is the binary label, 'TARGET', indicating whether the loan entered into default status or not.

1.1.1. Inventory of resources

List the resources available to the project including:

  • Personnel:
    • Mel Schwan
    • Stuart Miller
    • Justin Howard
    • Paul Adams
  • Data:
    • 307,511 individual loans
    • application_{train|test}.csv
    • bureau.csv
    • bureau_balance.csv
    • POS_CASH_balance.csv
    • credit_card_balance.csv
    • previous_application.csv
    • installments_payments.csv
    • HomeCredit_columns_description.csv |
  • Computing resources:
    • Four personnel computers
  • Software:
    • Jupyter Notebook
    • Library Packages
      • Seaborne

1.1.2. Requirements, assumptions and constraints -

  • Requirements
    • Perform analysis of a data set: exploring the statistical summaries of the features, visualizing the attributes, and making conclusions from the visualizations and analysis. Follow the CRISP-DM framework in your analysis (you are not performing all of the CRISP-DM outline)
    • Demostrate the teams understanding of Homecredits Business and the data model to best predict default on loans
  • Assumptions

    • Outlier assumptions
    • Missing data assumptions
    • Joined dataset assumptions
  • Constraits

    • Additional datasets
    • Data

1.1.3.Risks and contingencies

  • Imputation of data
  • Elimintation of features

1.1.4.Terminology

  • A glossary of relevant business terminology
  • A glossary of data mining terminology, illustrated with examples relevant to the business problem in question.

1.1.5.Costs and benefits

  • Cost

    • Many people struggle to get loans due to insufficient or non-existent credit histories. And, unfortunately, this population is often taken advantage of by untrustworthy lenders.
  • Benefits

    • Home Credit strives to broaden financial inclusion for the unbanked population by providing a positive and safe borrowing experience. In order to make sure this underserved population has a positive loan experience, Home Credit makes use of a variety of alternative data--including telco and transactional information--to predict their clients' repayment abilities.

1.2 What are the desired outputs of the project?

Business success criteria

  • Reduce default rates to five precent

Data mining success criteria

  • Build a data model that has a ROC value of ?

Produce project plan -

  • 1.0 Business Understanding
    • Describe the purpose of the data set you selected
    • Describe how you would define and measure the outcomes from the dataset
    • How would you measure the effectiveness of a good prediction algorithm
  • 2.0 Data Understanding
    • Describe the meaning and type of data for each attribute in the data file
    • Verify data quality: Explain any missing values, duplicate data, and outliers
    • Give simple, appropriate statistics (range, mode, mean, median, variance, counts, etc.) for the most important attributes and describe what they mean
    • Visualize the most important attributes appropriately (at least 5 attributes)
    • Explore relationships between attributes: Look at the attributes via scatter plots, correlation, cross-tabulation, group-wise averages, etc. as appropriate
    • Identify and explain interesting relationships between features and the class you are trying to predict
    • Are there other features that could be added to the data or created from existing features? Which ones?

1.3 What Questions Are We Trying To Answer?

Use Home Credit current customer data to predict whether a potential client is capable of repayment of the loan requested. During this process we will determine the features that are most influencial in determining this target variable.

2. Stage Two - Data Understanding

The second stage of the CRISP-DM process requires you to acquire the data listed in the project resources. This initial collection includes data loading, if this is necessary for data understanding. For example, if you use a specific tool for data understanding, it makes perfect sense to load your data into this tool. If you acquire multiple data sources then you need to consider how and when you're going to integrate these.

2.1 Initial Data Report .

Initial data collection report - List the data sources acquired together with their locations, the methods used to acquire them and any problems encountered. Record problems you encountered and any resolutions achieved. This will help both with future replication of this project and with the execution of similar future projects.

In [2]:
# Import Libraries Required.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns

# import custom code
from cleaning import read_clean_data, missing_values_table, load_bureau, create_newFeatures, merge_newFeatures 

# some defaults
pd_max_rows_default = 60
In [3]:
# load data 
# path =  './application_train.csv'
# note that XNA is a encoding for NA interpret as np.nan
df =  pd.read_csv('./application_train.csv',
                 na_values = ['XNA'])
#loading bureau dataset

bureau = pd.read_csv('./bureau.csv',
                    na_values = ['XNA'])
load_bureau()
Out[3]:
SK_ID_CURR SK_ID_BUREAU CREDIT_ACTIVE CREDIT_CURRENCY DAYS_CREDIT CREDIT_DAY_OVERDUE DAYS_CREDIT_ENDDATE DAYS_ENDDATE_FACT AMT_CREDIT_MAX_OVERDUE CNT_CREDIT_PROLONG AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE CREDIT_TYPE DAYS_CREDIT_UPDATE AMT_ANNUITY
0 215354 5714462 Closed currency 1 -497 0 -153.0 -153.0 0.0 0 91323.00 0.0 0.0 0.0 Consumer credit -131 0.0
1 215354 5714463 Active currency 1 -208 0 1075.0 0.0 0.0 0 225000.00 171342.0 0.0 0.0 Credit card -20 0.0
2 215354 5714464 Active currency 1 -203 0 528.0 0.0 0.0 0 464323.50 0.0 0.0 0.0 Consumer credit -16 0.0
3 215354 5714465 Active currency 1 -203 0 0.0 0.0 0.0 0 90000.00 0.0 0.0 0.0 Credit card -16 0.0
4 215354 5714466 Active currency 1 -629 0 1197.0 0.0 77674.5 0 2700000.00 0.0 0.0 0.0 Consumer credit -21 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1716423 259355 5057750 Active currency 1 -44 0 -30.0 0.0 0.0 0 11250.00 11250.0 0.0 0.0 Microloan -19 0.0
1716424 100044 5057754 Closed currency 1 -2648 0 -2433.0 -2493.0 5476.5 0 38130.84 0.0 0.0 0.0 Consumer credit -2493 0.0
1716425 100044 5057762 Closed currency 1 -1809 0 -1628.0 -970.0 0.0 0 15570.00 0.0 0.0 0.0 Consumer credit -967 0.0
1716426 246829 5057770 Closed currency 1 -1878 0 -1513.0 -1513.0 0.0 0 36000.00 0.0 0.0 0.0 Consumer credit -1508 0.0
1716427 246829 5057778 Closed currency 1 -463 0 0.0 -387.0 0.0 0 22500.00 0.0 0.0 0.0 Microloan -387 0.0

1716428 rows × 17 columns

In [4]:
#engineering features from bureau dataset
newFeatures = create_newFeatures(bureau)
newFeatures.head()
Out[4]:
LOAN_COUNT CREDIT_ACTIVE CREDIT_DAY_OVERDUE AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE
SK_ID_CURR
100001 7 3.0 0.0 884025.000 596686.5 0.000 0.0
100002 8 2.0 0.0 481988.565 245781.0 31988.565 0.0
100003 4 1.0 0.0 810000.000 0.0 810000.000 0.0
100004 2 0.0 0.0 0.000 0.0 0.000 0.0
100005 3 2.0 0.0 598626.000 568408.5 0.000 0.0
In [5]:
# load data 
# path =  './application_train.csv'
# note that XNA is a encoding for NA interpret as np.nan
df =  pd.read_csv('./application_train.csv',
                 na_values = ['XNA'])
#loading bureau dataset

bureau = pd.read_csv('./bureau.csv',
                    na_values = ['XNA'])
#load_bureau()

2.2 Describe Data

Data description report - Describe the data that has been acquired including its format, its quantity (for example, the number of records and fields in each table), the identities of the fields and any other surface features which have been discovered. Evaluate whether the data acquired satisfies your requirements.

We will use two of the files from the total dataset.

  • application_train.csv: Information provided with each loan application
  • bureau.csv: Information regarding clients from the credit bureaus

Loan Application Features

There are 122 features and 307511 observations in application_train.csv.

In [6]:
df.shape
Out[6]:
(307511, 122)

The following features are loan application attributes from application_train.csv. Descriptions, types, and and units are given for each feature.

Feature Description Type Units
SK_ID_CURR ID of loan in our sample Category N/A
TARGET Target Variable (1 - difficulty paying loan, 0 - all other cases) Category N/A
NAME_CONTRACT_TYPE Identification if loan is cash or revolving Category N/A
CODE_GENDER Gender of the client (M - male, F - female) Category N/A
FLAG_OWN_CAR Flag if the client owns a car Category N/A
FLAG_OWN_REALTY Flag if client owns a house or flat Category N/A
CNT_CHILDREN Number of children the client has Coninuous N/A
AMT_INCOME_TOTAL Income of the client Coninuous Currency
AMT_CREDIT Credit amount of the loan Coninuous Currency
AMT_ANNUITY Loan annuity Coninuous Currency
AMT_GOODS_PRICE For consumer loans it is the price of the goods for which the loan is given Coninuous Currency
NAME_TYPE_SUITE Who was accompanying client when he was applying for the loan Category N/A
NAME_INCOME_TYPE Clients income type (businessman, working, maternity leave) Category N/A
NAME_EDUCATION_TYPE Level of highest education the client achieved Category N/A
NAME_FAMILY_STATUS Family status of the client Category N/A
NAME_HOUSING_TYPE What is the housing situation of the client (renting, living with parents, ...) Category N/A
REGION_POPULATION_RELATIVE Normalized population of region where client lives (higher number means the client lives in more populated region) Coninuous Days
DAYS_BIRTH Client's age in days at the time of application Coninuous Days
DAYS_EMPLOYED How many days before the application the person started current employment Coninuous Days
DAYS_REGISTRATION How many days before the application did client change his registration Coninuous Days
DAYS_ID_PUBLISH How many days before the application did client change the identity document with which he applied for the loan Coninuous Days
OWN_CAR_AGE Age of client's car Coninuous Months
FLAG_MOBIL Did client provide mobile phone (Y, N) Category N/A
FLAG_EMP_PHONE Did client provide work phone (Y, N) Category N/A
FLAG_WORK_PHONE Did client provide home phone (Y, N) Category N/A
FLAG_CONT_MOBILE Was mobile phone reachable (Y, N) Category N/A
FLAG_PHONE Did client provide home phone (Y, N) Category N/A
FLAG_EMAIL Did client provide email (Y, N) Category N/A
CNT_FAM_MEMBERS What kind of occupation does the client have Category N/A
OCCUPATION_TYPE How many family members does client have Category N/A
REGION_RATING_CLIENT Our rating of the region where client lives (1,2,3) Category N/A
REGION_RATING_CLIENT_W_CITY Our rating of the region where client lives with taking city into account (1,2,3) Category N/A
WEEKDAY_APPR_PROCESS_START On which day of the week did the client apply for the loan Category N/A
HOUR_APPR_PROCESS_START Approximately at what hour did the client apply for the loan Category N/A
REG_REGION_NOT_LIVE_REGION Flag if client's permanent address does not match contact address (1=different, 0=same, at region level) Category N/A
REG_REGION_NOT_WORK_REGION Flag if client's permanent address does not match work address (1=different, 0=same, at region level) Category N/A
LIVE_REGION_NOT_WORK_REGION Flag if client's contact address does not match work address (1=different, 0=same, at region level) Category N/A
REG_CITY_NOT_LIVE_CITY Flag if client's permanent address does not match contact address (1=different, 0=same, at city level) Category N/A
REG_CITY_NOT_WORK_CITY Flag if client's permanent address does not match work address (1=different, 0=same, at city level) Category N/A
LIVE_CITY_NOT_WORK_CITY Flag if client's contact address does not match work address (1=different, 0=same, at city level) Category N/A
ORGANIZATION_TYPE Type of organization where client works Category N/A
EXT_SOURCE_1 Normalized score from external data source Coninuous N/A
EXT_SOURCE_2 Normalized score from external data source Coninuous N/A
EXT_SOURCE_3 Normalized score from external data source Coninuous N/A
OBS_30_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings with observable 30 DPD (days past due) default Coninuous N/A
DEF_30_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings defaulted on 30 DPD (days past due) Coninuous N/A
OBS_60_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings with observable 60 DPD (days past due) default Coninuous N/A
DEF_60_CNT_SOCIAL_CIRCLE How many observation of client's social surroundings defaulted on 60 (days past due) DPD Coninuous N/A
DAYS_LAST_PHONE_CHANGE How many days before application did client change phone Coninuous N/A
FLAG_DOCUMENT_2 Did client provide document 2 Category N/A
FLAG_DOCUMENT_3 Did client provide document 3 Category N/A
FLAG_DOCUMENT_4 Did client provide document 4 Category N/A
FLAG_DOCUMENT_5 Did client provide document 5 Category N/A
FLAG_DOCUMENT_6 Did client provide document 6 Category N/A
FLAG_DOCUMENT_7 Did client provide document 7 Category N/A
FLAG_DOCUMENT_8 Did client provide document 8 Category N/A
FLAG_DOCUMENT_9 Did client provide document 9 Category N/A
FLAG_DOCUMENT_10 Did client provide document 10 Category N/A
FLAG_DOCUMENT_11 Did client provide document 11 Category N/A
FLAG_DOCUMENT_12 Did client provide document 12 Category N/A
FLAG_DOCUMENT_13 Did client provide document 13 Category N/A
FLAG_DOCUMENT_14 Did client provide document 14 Category N/A
FLAG_DOCUMENT_15 Did client provide document 15 Category N/A
FLAG_DOCUMENT_16 Did client provide document 16 Category N/A
FLAG_DOCUMENT_17 Did client provide document 17 Category N/A
FLAG_DOCUMENT_18 Did client provide document 18 Category N/A
FLAG_DOCUMENT_19 Did client provide document 19 Category N/A
FLAG_DOCUMENT_20 Did client provide document 20 Category N/A
FLAG_DOCUMENT_21 Did client provide document 21 Category N/A
AMT_REQ_CREDIT_BUREAU_HOUR Number of enquiries to Credit Bureau about the client one hour before application Category N/A
AMT_REQ_CREDIT_BUREAU_DAY Number of enquiries to Credit Bureau about the client one day before application (excluding one hour before application) Category N/A
AMT_REQ_CREDIT_BUREAU_WEEK Number of enquiries to Credit Bureau about the client one week before application (excluding one day before application) Category N/A
AMT_REQ_CREDIT_BUREAU_MON Number of enquiries to Credit Bureau about the client one month before application (excluding one week before application) Category N/A
AMT_REQ_CREDIT_BUREAU_QRT Number of enquiries to Credit Bureau about the client 3 month before application (excluding one month before application) Category N/A
AMT_REQ_CREDIT_BUREAU_YEAR Number of enquiries to Credit Bureau about the client one day year (excluding last 3 months before application) Category N/A

Engineered Loan Application Features

The following features were engineered from the loan application features (from application_train.csv).

Engineered Feature Description Type Units Formula
CREDIT_INCOME_RATIO The percentage credit relative to client's income Numeric N/A AMT_CREDIT / AMT_INCOME_TOTAL
ANNUITY_INCOME_RATIO The percentage annunity relative to client's income Numeric N/A AMT_ANNUITY / AMT_INCOME_TOTAL
PERCENT_EMPLOYED_TO_AGE The fraction of client's days employed. Numeric N/A DAYS_EMPLOYED / DAYS_BIRTH

The following features were engineered from the bureau features (from bureau.csv).

Engineered Feature Description Type Units Formula
LOAN_COUNT The total number of accounts, active and closed. Numeric N/A length of CREDIT_ACTIVE, grouped by loan ID
CREDIT_ACTIVE A count of active credit accounts by loan ID Numeric N/A len(bureau['CREDIT_ACTIVE'] == 'Active')
CREDIT_DAY_OVERDUE A count of days overdue for active credit accounts by loan ID Numeric N/A sum of CREDIT_DAY_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM Total credit available from active accounts Numeric N/A sum of AMT_CREDIT_SUM for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM_DEBT Total debt of active accounts Numeric N/A sum of AMT_CREDIT_SUM_DEBT for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM_LIMIT Overall credit limit of active accounts Numeric N/A sum of AMT_CREDIT_SUM_LIMIT for CREDIT_ACTIVE == Active, grouped by loan ID
AMT_CREDIT_SUM_OVERDUE Total amount overdue Numeric N/A sum of AMT_CREDIT_SUM_OVERDUE for CREDIT_ACTIVE == Active, grouped by loan ID

2.3 Verify Data Quality

Note: All this resolutions to data quality described in this section are implemented in cleaning.py, which is used to load the cleaned dataset.

2.3.1. Missing Data

In addition to incorrect datatypes, another common problem when dealing with real-world data is missing values. These can arise for many reasons and have to be either filled in or removed before we train a machine learning model. First, let’s get a sense of how many missing values are in each column

While we always want to be careful about removing information, if a column has a high percentage of missing values, then it probably will not be useful to our model. The threshold for removing columns should depend on the problem

In [7]:
# create missing values table with all rows
pd.set_option('display.max_rows', 122)
missing_values_table(df)
Your selected dataframe has 122 columns.
There are 69 columns that have missing values.
Out[7]:
Missing Values % of Total Values
COMMONAREA_AVG 214865 69.9
COMMONAREA_MEDI 214865 69.9
COMMONAREA_MODE 214865 69.9
NONLIVINGAPARTMENTS_MEDI 213514 69.4
NONLIVINGAPARTMENTS_AVG 213514 69.4
NONLIVINGAPARTMENTS_MODE 213514 69.4
FONDKAPREMONT_MODE 210295 68.4
LIVINGAPARTMENTS_AVG 210199 68.4
LIVINGAPARTMENTS_MEDI 210199 68.4
LIVINGAPARTMENTS_MODE 210199 68.4
FLOORSMIN_MEDI 208642 67.8
FLOORSMIN_AVG 208642 67.8
FLOORSMIN_MODE 208642 67.8
YEARS_BUILD_MODE 204488 66.5
YEARS_BUILD_MEDI 204488 66.5
YEARS_BUILD_AVG 204488 66.5
OWN_CAR_AGE 202929 66.0
LANDAREA_MEDI 182590 59.4
LANDAREA_AVG 182590 59.4
LANDAREA_MODE 182590 59.4
BASEMENTAREA_MODE 179943 58.5
BASEMENTAREA_MEDI 179943 58.5
BASEMENTAREA_AVG 179943 58.5
EXT_SOURCE_1 173378 56.4
NONLIVINGAREA_MODE 169682 55.2
NONLIVINGAREA_MEDI 169682 55.2
NONLIVINGAREA_AVG 169682 55.2
ELEVATORS_MODE 163891 53.3
ELEVATORS_AVG 163891 53.3
ELEVATORS_MEDI 163891 53.3
WALLSMATERIAL_MODE 156341 50.8
APARTMENTS_MODE 156061 50.7
APARTMENTS_MEDI 156061 50.7
APARTMENTS_AVG 156061 50.7
ENTRANCES_AVG 154828 50.3
ENTRANCES_MEDI 154828 50.3
ENTRANCES_MODE 154828 50.3
LIVINGAREA_MEDI 154350 50.2
LIVINGAREA_MODE 154350 50.2
LIVINGAREA_AVG 154350 50.2
HOUSETYPE_MODE 154297 50.2
FLOORSMAX_AVG 153020 49.8
FLOORSMAX_MODE 153020 49.8
FLOORSMAX_MEDI 153020 49.8
YEARS_BEGINEXPLUATATION_MEDI 150007 48.8
YEARS_BEGINEXPLUATATION_MODE 150007 48.8
YEARS_BEGINEXPLUATATION_AVG 150007 48.8
TOTALAREA_MODE 148431 48.3
EMERGENCYSTATE_MODE 145755 47.4
OCCUPATION_TYPE 96391 31.3
EXT_SOURCE_3 60965 19.8
ORGANIZATION_TYPE 55374 18.0
AMT_REQ_CREDIT_BUREAU_MON 41519 13.5
AMT_REQ_CREDIT_BUREAU_WEEK 41519 13.5
AMT_REQ_CREDIT_BUREAU_QRT 41519 13.5
AMT_REQ_CREDIT_BUREAU_DAY 41519 13.5
AMT_REQ_CREDIT_BUREAU_HOUR 41519 13.5
AMT_REQ_CREDIT_BUREAU_YEAR 41519 13.5
NAME_TYPE_SUITE 1292 0.4
OBS_30_CNT_SOCIAL_CIRCLE 1021 0.3
DEF_30_CNT_SOCIAL_CIRCLE 1021 0.3
OBS_60_CNT_SOCIAL_CIRCLE 1021 0.3
DEF_60_CNT_SOCIAL_CIRCLE 1021 0.3
EXT_SOURCE_2 660 0.2
AMT_GOODS_PRICE 278 0.1
AMT_ANNUITY 12 0.0
CODE_GENDER 4 0.0
CNT_FAM_MEMBERS 2 0.0
DAYS_LAST_PHONE_CHANGE 1 0.0
In [8]:
# return row display setting to default
pd.set_option('display.max_rows', pd_max_rows_default)

Analysis of Missing Data in Each Feature

Each feature or set of features will be discussed in descending order of the rate of missing values.

Building Features with Missing Values
A large number of the features with values are normlaized infromation about the build where the client lives (such as ENTRANCES_MODE or COMMONAREA_AVG). It is plausible that only some of these features exist in a given client's building. It is also plausible that some of the records are missing (not provided by the client). There does not appear to be an indication if values are missing or not applicable to the client. All of these values have a missing rate above 40%.

Missing values in OWN_CAR_AGE
This feature refers to the age of cars owned by the client. Approximately 66.0% of the values are empty. However, there is also a feature FLAG_OWN_CAR, which indicates that the client owns a car. It is reasonable to expect that values will be missing if the client does not own a car. For clients reporting owning a car, all but 5 car age values are present. For clients reporting not owning a car, no car ages are reported. Since the missing rate is actually very small, these missing values could be imputed.

In [9]:
# get indexes of OWN_CAR_AGE that are not NA
car_age_indexes = df[df.OWN_CAR_AGE.notnull()].index
# get indexes of FLAG_OWN_CAR with value equal to Y (client owns car)
owns_car_indexes = df.query('FLAG_OWN_CAR == "Y"').index
print('There are {} records with ages for cars'.format(car_age_indexes.shape[0]))
print('There are {} records indicating that the client owns a car'.format(owns_car_indexes.shape[0]))
# get the number of non-null car age values on records where client does not list a car
car_age_not_own_car = df.query('FLAG_OWN_CAR == "N"').OWN_CAR_AGE.notnull().sum()
print('There are {} car ages reported for clients that report NOT owning a car'.format(car_age_not_own_car))
There are 104582 records with ages for cars
There are 104587 records indicating that the client owns a car
There are 0 car ages reported for clients that report NOT owning a car

Missing values in EXT_SOURCE_1, EXT_SOURCE_2, and EXT_SOURCE_3
Since there are zeros in these columns, we expect that the missing values represent lack of the external source these clients. Therefore, we will add an encoding feature EXT_SOURCE_<number>_AV for each of the external source features that represents the presence of external information. These original columns will only be used as an interaction with these encoding features and the missing values will be filled with zero.

  • EXT_SOURCE_1 has a missing rate of 56.4%
  • EXT_SOURCE_2 has a missing rate of 0.2%
  • EXT_SOURCE_3 has a missing rate of 19.8%

Missing values in OCCUPATION_TYPE
There does not appear to be a indication that OCCUPATION_TYPE is systematically missing. We will assume that the client did not provide this information and impute with a new categorical level Unknown. OCCUPATION_TYPE has a missing rate of 31.3%.

Missing values in ORGANIZATION_TYPE
Later in the outliers section, it is shown that the NAs in ORGANIZATION_TYPE are associated with occupations listed as 'Unemployed' or 'Pensioners'. We will take this to mean that these clients are not assocated with any employer and will impute these NAs with 'None'. ORGANIZATION_TYPE has a missing rate of 18%.

Missing values for AMT_REQ_CREDIT_BUREAU Features
These features indicate the number of credit enquiries on the client in a given time period: hour, day, week, month, quarter, and year. There does not appear to be a systematic reason the missing values in these features. We will treat these records as if there are no credit enquires for these records and impute with zero. These features have a missing rate of 13.5%.

Remaining Features with Low Missing Rate
The remaining missing features are missing at rates below 0.5%. Imputation on these features should have only a small impact on the total dataset. We will use standard imputation strategies for these features: imputation of the mode for categorical features and imputation of the median for continuous features. We are imputing with continuous features median rather than mode because the continuous features are skewed and median is not impacted by large values in the features.

In [10]:
pd.set_option('display.max_rows', 122)
data = read_clean_data()
missing_values_table(data)
Your selected dataframe has 82 columns.
There are 0 columns that have missing values.
Out[10]:
Missing Values % of Total Values
In [11]:
#engineering features from bureau dataset
newFeatures = create_newFeatures(bureau)
newFeatures.head()
newFeatures = newFeatures.fillna(0)
In [12]:
data = data.merge(newFeatures, on = 'SK_ID_CURR', how = 'left')
#data = data.fillna(0)
data.head()
Out[12]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... CREDIT_INCOME_RATIO ANNUITY_INCOME_RATIO PERCENT_EMPLOYED_TO_AGE LOAN_COUNT CREDIT_ACTIVE CREDIT_DAY_OVERDUE AMT_CREDIT_SUM AMT_CREDIT_SUM_DEBT AMT_CREDIT_SUM_LIMIT AMT_CREDIT_SUM_OVERDUE
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 2.007889 0.121978 0.067329 8.0 2.0 0.0 481988.565 245781.0 31988.565 0.0
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 4.790750 0.132217 0.070862 4.0 1.0 0.0 810000.000 0.0 810000.000 0.0
2 100004 0 Revolving loans M Y Y 0 67500.0 135000.0 6750.0 ... 2.000000 0.100000 0.011814 2.0 0.0 0.0 0.000 0.0 0.000 0.0
3 100006 0 Cash loans F N Y 0 135000.0 312682.5 29686.5 ... 2.316167 0.219900 0.159905 NaN NaN NaN NaN NaN NaN NaN
4 100007 0 Cash loans M N Y 0 121500.0 513000.0 21865.5 ... 4.222222 0.179963 0.152418 1.0 0.0 0.0 0.000 0.0 0.000 0.0

5 rows × 89 columns

2.3.2. Outliers

At this point, we may also want to remove outliers. These can be due to typos in data entry, mistakes in units, or they could be legitimate but extreme values. For this project, we will remove anomalies based on the definition of extreme outliers:

https://www.itl.nist.gov/div898/handbook/prc/section1/prc16.htm

  • Below the first quartile − 3 ∗ interquartile range
  • Above the third quartile + 3 ∗ interquartile range

High Values in DAYS_EMPLOYED

There are a large number of entries for DAYS_EMPLOYED outside the main distribution. These entries are at value 365243 and there are 55374 instances.

In [13]:
df.DAYS_EMPLOYED.hist(bins = 50);
In [14]:
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.shape
Out[14]:
(55374,)
In [15]:
df.query("DAYS_EMPLOYED >= 100000").DAYS_EMPLOYED.head()
Out[15]:
8     365243
11    365243
23    365243
38    365243
43    365243
Name: DAYS_EMPLOYED, dtype: int64

These high values for DAYS_EMPLOYED appear to be associated with clients that are "Pensioners" or Unemployed and do not list an employment orgainization (ORGANIZATION_TYPE).

In [16]:
# get the instances with NAME_INCOME_TYPE either Pensioner or Unemployed
filtered_index = df.query('NAME_INCOME_TYPE == "Pensioner" | NAME_INCOME_TYPE == "Unemployed"')
# filter to NAs for ORGANIZATION_TYPE and get the index of the array
filtered_index = df[df.ORGANIZATION_TYPE.isna()].index
# get the indexes of the high values
high_val_index = df.query("DAYS_EMPLOYED >= 100000").index
# assert that indexes are the same
if np.equal(filtered_index, high_val_index).all():
    print('Index of queried values are the same.')
else:
    print('Indexes of queried values are different.')
Index of queried values are the same.

Possible Solution

A possible solution to this data quality issue would be to add a None level to ORGANIZATION_TYPE and only use DAYS_EMPLOYED as an interaction with ORGANIZATION_TYPE.

Data Cleaning Script

All the cleaning discussed in the sections above are implemented in cleaning.py. This script contains a function (read_clean_data) to apply the cleaning steps and return the cleaned dataset for work.

Details

  • Cleaning
    • Read csv with Pandas (setting correct data types)
    • Drop columns that will not be used
    • Recode NA values that are not listed as np.nan
    • Formattings
    • Encode categorical variables
  • Returns
    • DataFrame with cleaned data

Sample Output

In [17]:
data = read_clean_data()
data.head(2)
Out[17]:
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY ... AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR EMPLOYED EXT_SOURCE_1_AV EXT_SOURCE_2_AV EXT_SOURCE_3_AV CREDIT_INCOME_RATIO ANNUITY_INCOME_RATIO PERCENT_EMPLOYED_TO_AGE
0 100002 1 Cash loans M N Y 0 202500.0 406597.5 24700.5 ... 0 0 1 1 1 1 1 2.007889 0.121978 0.067329
1 100003 0 Cash loans F N N 0 270000.0 1293502.5 35698.5 ... 0 0 0 1 1 1 0 4.790750 0.132217 0.070862

2 rows × 82 columns

2.4 Initial Data Exploration

During this stage you'll address data mining questions using querying, data visualization and reporting techniques. These may include:

  • Distribution of key attributes (for example, the target attribute of a prediction task)
  • Relationships between pairs or small numbers of attributes
  • Results of simple aggregations
  • Properties of significant sub-populations
  • Simple statistical analyses

These analyses may directly address your data mining goals. They may also contribute to or refine the data description and quality reports, and feed into the transformation and other data preparation steps needed for further analysis.

  • Data exploration report - Describe results of your data exploration, including first findings or initial hypothesis and their impact on the remainder of the project. If appropriate you could include graphs and plots here to indicate data characteristics that suggest further examination of interesting data subsets.

We used pandas profiler to explore the data sets range, mode, mean, median, variance, counts, etc. After initial assessment we will focus on the most important attributes and describe what their meanin and add interesting observations we have noted for the dataset. NOTE: Install pandas_profiling into your Conda environment by running "conda install -c conda-forge pandas-profiling"

In [18]:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport

# Generate the pandas_profile report
profile = ProfileReport(data, title='Pandas Profiling Report')
#Run interactive report within Jupyter
profile
Report generated with pandas-profiling.
Out[18]:

In [22]:
profile.to_file(output_file="output.html")
In [23]:
profile.to_notebook_iframe()
In [ ]:
profile = ProfileReport(data, minimal=True)
profile.to_file(output_file="output.html")

2.4.1 Distributions

In [19]:
def count_values_table(df):
        count_val = df.value_counts()
        count_val_percent = 100 * df.value_counts() / len(df)
        count_val_table = pd.concat([count_val, count_val_percent.round(1)], axis=1)
        count_val_table_ren_columns = count_val_table.rename(
        columns = {0 : 'Count Values', 1 : '% of Total Values'})
        return count_val_table_ren_columns
In [20]:
# Histogram
def hist_chart(df, col):
        plt.style.use('fivethirtyeight')
        plt.hist(df[col].dropna(), edgecolor = 'k');
        plt.xlabel(col); plt.ylabel('Number of Entries'); 
        plt.title('Distribution of '+col);
In [21]:
col = 'account_risk_band'
# Histogram & Results
hist_chart(df, col)
count_values_table(df.account_risk_band)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2896             try:
-> 2897                 return self._engine.get_loc(key)
   2898             except KeyError:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'account_risk_band'

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-21-5f924fb5c1c6> in <module>
      1 col = 'account_risk_band'
      2 # Histogram & Results
----> 3 hist_chart(df, col)
      4 count_values_table(df.account_risk_band)

<ipython-input-20-9efa23ef357f> in hist_chart(df, col)
      2 def hist_chart(df, col):
      3         plt.style.use('fivethirtyeight')
----> 4         plt.hist(df[col].dropna(), edgecolor = 'k');
      5         plt.xlabel(col); plt.ylabel('Number of Entries');
      6         plt.title('Distribution of '+col);

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
   2993             if self.columns.nlevels > 1:
   2994                 return self._getitem_multilevel(key)
-> 2995             indexer = self.columns.get_loc(key)
   2996             if is_integer(indexer):
   2997                 indexer = [indexer]

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexes\base.py in get_loc(self, key, method, tolerance)
   2897                 return self._engine.get_loc(key)
   2898             except KeyError:
-> 2899                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2900         indexer = self.get_indexer([key], method=method, tolerance=tolerance)
   2901         if indexer.ndim > 1 or indexer.size > 1:

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\index.pyx in pandas._libs.index.IndexEngine.get_loc()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

pandas\_libs\hashtable_class_helper.pxi in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: 'account_risk_band'

2.4.2 Correlations

Can we derive any correlation from this data-set. Pairplot chart gives us correlations, distributions and regression path Correlogram are awesome for exploratory analysis. It allows to quickly observe the relationship between every variable of your matrix. It is easy to do it with seaborn: just call the pairplot function

Pairplot Documentation cab be found here: https://seaborn.pydata.org/generated/seaborn.pairplot.html

In [ ]:
#Seaborn allows to make a correlogram or correlation matrix really easily. 
#sns.pairplot(df.dropna().drop(['x'], axis=1), hue='y', kind ='reg')

#plt.show()
In [ ]:
#df_agg = df.drop(['x'], axis=1).groupby(['y']).sum()
df_agg = df.groupby(['y']).sum()

Differencing

Differencing Specifically, a new series is constructed where the value at the current time step is calculated as the difference between the original observation and the observation at the previous time step. value(t) = observation(t) - observation(t-1)

In [ ]:
df_dif_agg = df_agg
In [ ]:
#Differencing
#Specifically, a new series is constructed where the value at the current time step is calculated 
#as the difference between the original observation and the observation at the previous time step.
#value(t) = observation(t) - observation(t-1)
df_dif = df_dif_agg.diff()
In [ ]:
print(df_dif)

2.5 Data Quality Report

List the results of the data quality verification. If quality problems exist, suggest possible solutions. Solutions to data quality problems generally depend heavily on both data and business knowledge.

In [ ]:
 

3. Stage Three - Data Preperation

This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

3.1 Select Your Data

This is the stage of the project where you decide on the data that you're going to use for analysis. The criteria you might use to make this decision include the relevance of the data to your data mining goals, the quality of the data, and also technical constraints such as limits on data volume or data types. Note that data selection covers selection of attributes (columns) as well as selection of records (rows) in a table.

Rationale for inclusion/exclusion - List the data to be included/excluded and the reasons for these decisions.

In [ ]:
X_train_regr = df.drop(['date_maint', 'account_open_date'], axis = 1)
X_train = df.drop(['target', 'date_maint', 'account_open_date'], axis = 1)
X_test = test.drop(['date_maint', 'account_open_date'], axis = 1)

3.2 Clean The Data

This task involves raise the data quality to the level required by the analysis techniques that you've selected. This may involve selecting clean subsets of the data, the insertion of suitable defaults, or more ambitious techniques such as the estimation of missing data by modelling.

3.2.1 Label Encoding

Label Encoding to turn Categorical values to Integers

An approach to encoding categorical values is to use a technique called label encoding. Label encoding is simply converting each value in a column to a number. For example, the body_style column contains 5 different values. We could choose to encode it like this:

convertible -> 0 hardtop -> 1 hatchback -> 2 sedan -> 3 wagon -> 4 http://pbpython.com/categorical-encoding.html

In [ ]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
for col in CAT_COLS:
        encoder = LabelEncoder()
        X_train[col] = encoder.fit_transform(X_train[col].astype(str))
        X_test[col] = encoder.transform(X_test[col].astype(str))
In [ ]:
df["column"] = df["column"].astype('category')
df.dtypes
In [ ]:
df["column"] = df["column"].cat.codes
df.head()

3.2.2 Drop Unnecessary Columns

Sometimes we may not need certain columns. We can drop to keep only relevent data

In [ ]:
del_col_list = ['col1', 'col2']

df = df.drop(del_col_list, axis=1)
df.head()

3.2.3 Altering Data Types

Sometimes we may need to alter data types. Including to/from object datatypes

In [ ]:
#df['date'] = pd.to_datetime(df['date'])

3.2.4 Dealing With Zeros

Replacing all the zeros from cols. Note You may not want to do this - add / remove as required

In [ ]:
#cols = ['col1', 'col2']
#df[cols] = df[cols].replace(0, np.nan)
In [ ]:
# dropping all the rows with na in the columns mentioned above in the list.

# df.dropna(subset=cols, inplace=True)

3.2.5 Dealing With Duplicates

Remove duplicate rows. Note You may not want to do this - add / remove as required

In [ ]:
#df = df.drop_duplicates(keep='first')

3.3 Construct Required Data

This task includes constructive data preparation operations such as the production of derived attributes or entire new records, or transformed values for existing attributes.

Derived attributes - These are new attributes that are constructed from one or more existing attributes in the same record, for example you might use the variables of length and width to calculate a new variable of area.

Generated records - Here you describe the creation of any completely new records. For example you might need to create records for customers who made no purchase during the past year. There was no reason to have such records in the raw data, but for modelling purposes it might make sense to explicitly represent the fact that particular customers made zero purchases.

In [ ]:
 

3.4 Integrate Data

These are methods whereby information is combined from multiple databases, tables or records to create new records or values.

Merged data - Merging tables refers to joining together two or more tables that have different information about the same objects. For example a retail chain might have one table with information about each store’s general characteristics (e.g., floor space, type of mall), another table with summarised sales data (e.g., profit, percent change in sales from previous year), and another with information about the demographics of the surrounding area. Each of these tables contains one record for each store. These tables can be merged together into a new table with one record for each store, combining fields from the source tables.

Aggregations - Aggregations refers to operations in which new values are computed by summarising information from multiple records and/or tables. For example, converting a table of customer purchases where there is one record for each purchase into a new table where there is one record for each customer, with fields such as number of purchases, average purchase amount, percent of orders charged to credit card, percent of items under promotion etc.

In [ ]:
 

Construct Our Primary Data Set

Join data

In [ ]:
 

4. Stage Four - Exploratory Data Analysis

5. Stage Four - Modelling

As the first step in modelling, you'll select the actual modelling technique that you'll be using. Although you may have already selected a tool during the business understanding phase, at this stage you'll be selecting the specific modelling technique e.g. decision-tree building with C5.0, or neural network generation with back propagation. If multiple techniques are applied, perform this task separately for each technique.

5.1. Modelling technique

Document the actual modelling technique that is to be used.

Import Models below:

In [ ]:
 

5.2. Modelling assumptions

Many modelling techniques make specific assumptions about the data, for example that all attributes have uniform distributions, no missing values allowed, class attribute must be symbolic etc. Record any assumptions made.

-

5.3. Build Model

Run the modelling tool on the prepared dataset to create one or more models.

Parameter settings - With any modelling tool there are often a large number of parameters that can be adjusted. List the parameters and their chosen values, along with the rationale for the choice of parameter settings.

Models - These are the actual models produced by the modelling tool, not a report on the models.

Model descriptions - Describe the resulting models, report on the interpretation of the models and document any difficulties encountered with their meanings.

In [ ]:
 

5.4. Assess Model

Interpret the models according to your domain knowledge, your data mining success criteria and your desired test design. Judge the success of the application of modelling and discovery techniques technically, then contact business analysts and domain experts later in order to discuss the data mining results in the business context. This task only considers models, whereas the evaluation phase also takes into account all other results that were produced in the course of the project.

At this stage you should rank the models and assess them according to the evaluation criteria. You should take the business objectives and business success criteria into account as far as you can here. In most data mining projects a single technique is applied more than once and data mining results are generated with several different techniques.

Model assessment - Summarise the results of this task, list the qualities of your generated models (e.g.in terms of accuracy) and rank their quality in relation to each other.

Revised parameter settings - According to the model assessment, revise parameter settings and tune them for the next modelling run. Iterate model building and assessment until you strongly believe that you have found the best model(s). Document all such revisions and assessments.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: